Wine Appreciation by The Numbers

by Anna Signor

What is “good” wine? This is a captivating subject, since so much actual economy rides on judgements considered by many to be elusive.

Join me in exploring data pertaining to the analysis of over 6,000 instances of Portuguese Vinho Verde, juxtaposing objective measurements with the quality ratings of experts. Can we decode what makes an expert give a wine a high quality mark? Are they biased toward reds or whites? Are “better” wines less sweet? Let’s find out.

(I reccomend this read to go along one glass of your favorite wine.)

Sample the data

Let’s get to know our data set. You can find the complete information furnished buy the publishers here. In their words:

…two datasets were created, using red and white wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).

I merged the two set and added a column called type to indicate red or white, so we have the columns:

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"              "type"

Please note that I merged the two sets together and the field type indicates red or white. You can refer to the same link as above to get detailed definitions.


Univariate Plots Section

The first thing I’d like to know if if I have to account for a bias toward red or white in terms of quality. Let’s boxplot them side by side.

We learn a lot from this. There seems to be no bias, for both types of wine we have most values between 5 and 6, and a median of 6.

I am curious about the outliers. Let’s check the distribuition in a side-by-side histogram.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

And what do they lood like combined?

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Both look like a nice normal-like disribuition. In the back of my head, I am putting a pin on this. If I try to write a predictor for the quality, I need to be careful. We already know this distribuition, so any valuable predictor needs to be better than just guessing around the mean, because this could produce deceptively good results without actually adding value to a simple study of the distribuition.

I wonder which measured attributes do not behave that way.

## Using type as id variables
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This is how it looks for whites:

## Using type as id variables
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

And for reds:

## Using type as id variables
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Univariate Analysis

From this one-dimensional analysis I learned a lot. Firstly, the quality grades are in a normal-like distribution, so will need to be very strict before celebrating the accuracy of any predictive model. I am interested in exploring chlorides and residual sugar because of the extreme distribuitions I am seeing, and I am interested in the volatile acidity in red wines, because of the popular widsom around red wines having to “breathe”.

It is very hard to get any answers without exploring the relationships between the different features and their relationships to the quality. We do that next.


Bivariate Plots Section

One feature that stood out to me was the volatile acidity. This is how I understood from their documentation: volatile acidity is the kind that goes away in time after opening a bottle of wine, versus the actual acidity that “belongs” in the wine. I always heard that the best red wines don’t need to breathe, you can drink as soon as the bottle is opened. (And I never heard of white wines needing to breathe at all.) I wonder if that is related to the volatile acidity, and if we can see that in the data.

Let’s first plot the volatile acidity against the quality.

Not very informative. Some summarizing is in order.

There seems to be a clear trend where the quality decreases with the volatile acidity. We are on to something.

What happens if we include the whites? I would expect the relationshit to be different than with the reds, because I never heard of white wine having to breathe.

It seems like, in whites, not only the quality does not decrease with volatile acidity increase, but also that it is significantly and consistently less than reds. It also seems that low quality wines have higher volatile acidity than their type mates, no matter what.

A popular conception is that sweet wines are “bad”. Let’s take stab at that one.

Nope, that one is a bust. At least for Vinho Verde, this is not true. All we can say is that white wines in that family are consistently higher in sugar content, which surprised me a lot.

Let’s look at the impact of free sulfides into quality:

Not very illuminating. The biggest piece of information is the difference between reds and whites. It seems like, at least by itself, the free sulfide content has no clear impact on the quality.

Volatile acidity vs acidity:

There seems to be something of a correlation, but only in reds. I’d also guess that it is conditioned on other variables, which may be responsible for the outliars. I am just going to take some guesses on what they could be, maybe I’ll get lucky. So I am plotting just the red part of the same graph above, with the color being each one of my guesses. If the outliars tend to appear in a color that stands out, that will likely be my lurking variable.

Citric Acid:

Sugar content:

Sulphates:

Alcohol:

Chlorides:

This one looks more interesting, but not a Jackpot.

Density:

This one just looks well correlated with the fixed acidity. Let’s look at that:

Density

Yes, there is clearly a correlation.

Moving on, I’d like to see if there is an apparent correlation between alcohol content and quality:

Now, this is interesting. This seems to be a very relevant factor.

## Using type as id variables
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Let’s look at each variable besides the quality and type plotted against quality (grouped in averages) and pivoted by type. This should give us a good high-level view of the correlations between the variables and their conributions to the quality feature.

It certainly looks like there are a lot of relevant features with good correlations with the quality, as well as features that make up similar shapes, indicating that some of them may be correlated to each other. For example, the free sulfide and sugar content look similar, and the sulphate and volatile acidity seem to be inversely correlated.

Bivariate Analysis

The first observation that stands our to me is the “separation” between red and white wines. Only one of the features shows as if type is not a discerning factor, which is alcohol. Several of them have a similar shape (indicative of the correlational behavior with the quality), but are separated by a gap. Most concerning in terms of continuing to analyze both types together, some featues seem to have different behavior all together. In some cases, such as sulphate and volatile acidity, a feature that matters for red wine is non-discerning for white wines, and in other cases, such as pH, they have inverse behavior, where for one type the correlation is positive and for the other, negative. What his tells me is that if I continue with this approach and try to build one model to predict all wines, every time I tune the parameters I could be improving the predictive power for one group while damaging it for the other.

For this reason coupled with my own personal preference, I will proceed with an exploration for red wines only.

It certainly looks like there are a lot of relevant features with good correlations with the quality. The ones that most stand out to me are alcohol, citric acid, free sulfides and volatile acidity.

Some pairs of features make up similar shapes, indicating that some of them may be correlated to each other. For example, the free sulfide and total sulfide look very similar, and the sulphate and volatile acidity seem to be inversely correlated.


Multivariate Plots Section

The first thing I would like to check are the correlations between each feature and quality, now just for red wines:

It looks like a lot of these features are strong and discerning. Because of this, I will try a decision tree. Decision trees work best when the predicted value is categorical or ordered. So, let’s call a wine “ge” for “good or excellent”" when the quality grade is 6 or more, and bm for “bad or mediocre” otherwise. Looking back at our distribuition, this seems reasonable.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##   grade_f grade alcohol sulphates   pH density total.sulfur.dioxide
## 1      mb    mb     9.4      0.56 3.51  0.9978                   34
## 2      mb    mb     9.8      0.68 3.20  0.9968                   67
## 3      mb    mb     9.8      0.65 3.26  0.9970                   54
## 4      ge    ge     9.8      0.58 3.16  0.9980                   60
## 5      mb    mb     9.4      0.56 3.51  0.9978                   34
##   free.sulfur.dioxide chlorides residual.sugar citric.acid
## 1                  11     0.076            1.9        0.00
## 2                  25     0.098            2.6        0.00
## 3                  15     0.092            2.3        0.04
## 4                  17     0.075            1.9        0.56
## 5                  11     0.076            1.9        0.00
##   volatile.acidity fixed.acidity quality
## 1             0.70           7.4       5
## 2             0.88           7.8       5
## 3             0.76           7.8       5
## 4             0.28          11.2       6
## 5             0.70           7.4       5

As you can see, now we have a grade, which can only assume two values, and a factor column corresponding to it. Now, let’s split the data in to train and test, at a 75% rate, and take a peek at the train set just to make sure everything looks normal.

##   grade_f grade alcohol sulphates   pH density total.sulfur.dioxide
## 1      mb    mb     9.4      0.56 3.51  0.9978                   34
## 2      mb    mb     9.8      0.68 3.20  0.9968                   67
## 3      mb    mb     9.8      0.65 3.26  0.9970                   54
## 4      ge    ge     9.8      0.58 3.16  0.9980                   60
## 5      mb    mb     9.4      0.56 3.51  0.9978                   34
##   free.sulfur.dioxide chlorides residual.sugar citric.acid
## 1                  11     0.076            1.9        0.00
## 2                  25     0.098            2.6        0.00
## 3                  15     0.092            2.3        0.04
## 4                  17     0.075            1.9        0.56
## 5                  11     0.076            1.9        0.00
##   volatile.acidity fixed.acidity quality
## 1             0.70           7.4       5
## 2             0.88           7.8       5
## 3             0.76           7.8       5
## 4             0.28          11.2       6
## 5             0.70           7.4       5

Let’s look at a representation of a decision tree using only the features volatile acidity and sulphates content. Here, I made a deliberate decision to use feature that give strong correlations, and to use one positively and one negatively correlated (or apparently so), to give us the best chance to see a decent tree. Using two features is in no way the best modeling strategy, but it yields a good graphical representation of how decision trees work. We will eventually include more features, and that will look impalatable as a graph.

## Loading required package: party
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## 
## Attaching package: 'modeltools'
## The following objects are masked from 'package:memisc':
## 
##     Lapply, relabel
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich

To understand how the decision tree is working, let’s consider an example, from the test set.

rev(test[8, ] )
##    grade_f grade alcohol sulphates   pH density total.sulfur.dioxide
## 50      mb    mb     9.2      0.58 3.32  0.9954                   96
##    free.sulfur.dioxide chlorides residual.sugar citric.acid
## 50                  12     0.074            1.4        0.37
##    volatile.acidity fixed.acidity quality
## 50             0.31           5.6       5

This particular data point is a wine with a volatile acidity value of 0.31, suplphates at 0.58, and a quality rating of 5, which makes it a “mediocre or bad” wine, per our definition of grade. The diagram below shows how the tree above would predict this wine’s quality.

Picking just two features is a simplistic way to proceed (although I exercised judgement in picking them), but just for curiosity, let’s lee what accuracy we got from that, by testing the model with data that was not used to build the tree, that is the test data.

## [1] 0.6675

Not bad. Now, let’s build a tree with all the features, and test the model on data this tree has never “seen”.

## [1] 0.74

The accuracy is 74%. This exploration ends here, a further project would involve tuning the parameters of the ctree with the ctree_control class and improve those out-of-the-box results.

Multivariate Analysis

After plotting all the features against the quality, it was determined that all of theem seem relevant. I made the decision to create a new feature called grade, which segregates the records into two categories, and attempt to build a classifier.

As part of this exploration, I built a starter tree, with just two features, and it was a surprise to see it yield an accuracy of 65%. The final tree is an out of the box binary tree trained on 75% of the data and scoring an accuracy of 74%.


Final Plots and Summary

Plot One - why I chose to separate white from red

Observe how for many of the features, the correlational behavior is dramatically different. Enphasis on Average pH, sulphate content, and acidity. The exploration of the relationships could proceed in a more focused manner done separately. There is no statistical or data reason why I chose reds, as both whites and reds look to have interesting and relevant data, it was a personal choice.

Plot Two - alcohol content vs quality

In the grouped violines plot above, a trend is noticiable, where the red wines with" highes alcohol content tent to score higher quality ratings. Separating the wines in “good or exellent” and mediocre or bad by color, the grey horizontal line, which is severing the plane by the median of alcohol content, will leave the a majority of good or excellent wines above it, and a majority of mediocre or bad wines belo

Plot Two - volatile acidity vs quality

In similar fashion to the previous chart, a trend seems to present itself. In this case, the quality decreases with the volatile acidity. Higher quality wines are less and less likely to have a high amount of volatile acidity, and none of the wines graded good or excellent have more than 1.1 volatile acidity.

Reflection

This is a data set rich with features of significance, and while patterns emerge and are readily visible, more sophisticated analysis would be needed to render a well-performing predictive model. I posit the main reason for this is a reasonable amount of interdependency among the predictive freatures. We found at least one case of strong correlation, between alcohol contend and density, and I imagine there are more. The most insteresting part of this exploration, for me, was the volatile acidity and it’s correlation with the score.

An obvious next step, in my opinion, would be to fine tune and perhaps better structure a classifier. PCA, for example, come to mind, given the apparent dependency monsgt features. My predictive model of choice, in this case, would be a random forest classifier with PCA, and I would start by using the engineer feature “grade”, which assumes only one of two values, before moving on to more granularity.